\section{Experimental Setup}
\label{sec:setup}

The data sets we used in our experiments are depicted in Table~\ref{distribution}.
In our experimental settings, the Conll-2003\footnote{Te English portion of the Conll-2003
English Named Entity Recognition data set, for which the source data was taken from
Reuters newswire articles \cite{TjongKimSang:Buchholz:2000}} (\Conll) training set provided by~\cite{} is always used as the \textit{source corpus}. As target corpora, we used MUC-7\footnote{\url{https://catalog.ldc.upenn.edu/LDC2001T02}} (\MUC) and i2b2 (\Ib) \cite{abacha:2011}, which provides splits of training and test sets, from which we keep the test set, randomly sampled 10\% sentences from the whole training set as the validation set, and use the remaining ones as the training set. We also used the BBN\footnote{\url{https://catalog.ldc.upenn.edu/LDC2005T33}} (\BBN) corpus as target. \BBN provides no standard splits, thus we randomly select 20\% of the total documents as the test set. From the remaining 80\% documents, we sample 10\% sentences as the validation set, and the remaining sentences as the training set.
The entity types with less than 10 instances in the validation set were discarded (a list of the \BBN entities that were not considered can be found at:\url{ourexternalrepos}).

\begin{table*}
\begin{center}
\begin{small}
\caption{Corpora description}
\centering
\begin{small}
\begin{tabular}{c|c|c|l|r}
\hline
\textbf{Corpus} & \textbf{Entity type}s & \textbf{Domain} & \textbf{NE type} & \textbf{Instances} \\ \hline
\multirow{4}{*}{\Conll} & & & \location & 10645 \\
&  4 & Newswire articles &  \misc & 5062 \\
& & & \org & 9323 \\
& & & \person & 10059 \\ \hline 
 
\multirow{7}{*}{\Ib} &  &  & \age & 16 \\ 
& & & \dates & 7098 \\ 
& & & \doctor & 3751  \\ 
& 9 & Medical reports & \hospital & 2400 \\ 
& & & \ids & 4809 \\  
& & & \location & 4809 \\  
& & & \patient & 929  \\
& & & \phonenumber & 232 \\  \hline 
 
\multirow{51}{*}{\BBN} & &  & \animal & 396  \\
& & & \cardinal & 10227   \\ 
& & & \contactPhone & 23   \\ 
& & & \dateAge & 607   \\ 
& & & \dateDate & 16035   \\ 
& & & \dateDuration & 3116   \\ 
& & & \dateOther & 784   \\ 
& & & \disease & 317   \\ 
& & & \eventOther &  219  \\ 
& & & \facBuilding & 154  \\ 
& & & \facOther & 87  \\ 
& & & \facStreet & 116  \\ 
& & & \gpeCity & 5562 \\
& & & \gpeDescCity & 376 \\
& & & \gpeDescCountry &  980 \\
& & & \gpeDescOther &  69 \\
& & & \gpeDescState & 397  \\
& & & \gpeCountry & 5038 \\
& & & \gpeOther  &  182 \\
& & & \gpeState & 2688 \\
& & & \law & 380   \\ 
& & & \continent & 248  \\ 
& & & \lake & 78  \\ 
& & & \locOther & 181  \\ 
& & & \region & 520  \\ 
& 51 & News-wire articles & \money & 11065   \\ 
& & & \norpNationality &  3200  \\
& & & \norpPolitical &  675  \\
& & & \norpReligion & 87   \\
& & & \ordinal & 1091   \\ 
& & & \orgCoorp & 23323  \\  
& & & \orgGov & 4598  \\  
& & & \orgDescCoorp & 15107   \\
& & & \orgDescGov & 2493   \\ 
& & & \orgDescOther & 1191   \\
& & & \orgEducational & 365   \\ 
& & & \orgPolitical & 413   \\ 
& & & \percent &  5931   \\ 
& & & \person & 13675   \\ 
& & & \prodDescVehicle  & 1189 \\
& & & \prodVehicle & 380 \\ 
& & & \prodOther & 517   \\ 
& & & \quantity & 185 \\ % 1D
& & & \substanceChe & 531 \\ 
& & & \substanceFood & 884 \\
& & & \substanceOther & 848  \\ 
& & & \ti &  1095  \\ 
& & & \artOther & 515 \\
& & & \artSong & 39 \\  \hline 

\multirow{9}{*}{\MUC} &  &  & \dates & 2422 \\ 
& & & \location & 2890 \\
& & & \money & 332 \\  
& & & \org & 3985 \\
& 9 & News-wire articles & \person & 2156 \\
& & & \percent & 134 \\ 
& & & \ti & 488 \\ \
& & & \dates $|$ \ti & 2 \\ \
& & & \org $|$ \location & 2 \\ \hline  
\end{tabular}
\end{small}
\label{distribution}
\end{small}
\end{center}
\end{table*}

 % IDs: refers to any combination of numbers, letters, and special characters identifying medical records, patients, doctors, or hospitals, e.g., Provider Number: [12344

As shown in Table~\ref{distribution}, \BBN , \MUC and \Ib contained richer types than \Conll . The \BBN corpus covers almost all the types considered in the other data sets, except \location and \org . The entities that are common for all data sets are: \person , \location and \org .



\subsection{Baselines}

We compared our proposal with several baselines:

\begin{enumerate}
\item Linear-chain CRF with bag-of-words word representation.
\item Linear-chain CRF with word embeddings trained with skip-gram.
\item Linear-chain CRF taking word embeddings trained with skip-gram and a NER binary feature extracted from Stanford NER model, which is set to true when the token has been identified as
a Named Entity, and false otherwise, in a window of two tokens to the left and two tokens to the right.
%given that multi-word Named Entities are very frequent and Named Entities are sometimes preceded by markers such as Sr., Dr., among others.
%\item Linear-chain CRF with word embeddings trained with skip-gram, which is trained with a combination of a target training corpus and a source corpus.
\item Linear-chain CRF with word embeddings trained with skip-gram, which is trained with a combination of a target training corpus and a random sample source corpus that has the same size as the target training corpus.
%\item All above baselines with features specific to new named entity types.
\end{enumerate}

\subsection{Variations of Our methods}
We evaluate both two-layer graph transformers (without hidden layers) and deep graph transformers with the following settings on target corpora.
\begin{enumerate}
\item Random NE type matching
\item Manual NE matching.
\item Closest class weight vector matching.
\item Word embedding cluster center matching.
\item All above systems with features specific to new named entity types.
\end{enumerate}
In addition, for the best performing deep model, we study the influence of deep architecture by incrementally adding hidden layers. 

\subsection{Evaluation Methods}
As Table \ref{table:train_test_corpora} shows, we run several experiments with various combinations of training and test corpora in the target domain. For each training dataset, we split them into 10 partitions at logarithm scale and build 10 training datasets by merging the 10 partitions from the smallest one to the largest one.


% [htdp]
\begin{table}
\begin{center}
\begin{tabular}{ccc}\hline
\textbf{Training} & \textbf{Source} &  \textbf{Test} \\ \hline
\MUC training & \Conll training & \MUC \\
\BBN training & \Conll training & \BBN \\
\Ib training & \Conll training & \Ib\\ \hline
\end{tabular}
\end{center}
\caption{All combinations of training, test and source corpora in our experiments}
\label{table:train_test_corpora}
\end{table}

For each corpora combination, we compare all baselines against our model variations
employing different type matching schema (Section~\ref{sec:matching}) between the old types and the new types.

For all methods, we report micro-average of F1-measures. In addition, we report also micro-average of F1-measure for each relationship category between known and new NE types (Section~\ref{sec:DNNSeq}).
%As mentioned in Section~\ref{sec:DNNSeq}, we have assumed there is a relationship between the labels in the source and target corpora. Therefore, calculate the Micro-average (global measure over all the binary decisions where \textit{i} is the number of total test instances and \textit{m} is the number of categories in consideration)
%per each entity relation (\ref{Table:}).



















